CAMBRIDGE GARMENTS Industry (CGI) is a leading fast fashion retailer in Pakistan, that deals in traditional and western clothing styles.
The company has a presence in every major city of Pakistan, operating most of the stores by itself, which goes by the name The Cambridge Shop (TCS).
CGI has captured the pulse of white-collar Pakistani men, who generally like to wear quality cotton shirts, with its craftsmanship and uncompromising quality.
We got raw sales data for CGI Men’s shirt, which is typically used in the Supply Chain department for sales tracking and stock movement analysis.
The data has multiple numerical and many categorical variables.
The data is invoice influenced, where each record represents a portion of the transaction or a value of a particular shirt brand sold.
To make use of this data for any Exploratory Data Analysis (EDA) and Machine Learning (ML), it is necessary to clean the data and adapt it according to our requirements.
The initial dataset contains 46 columns and 670,000 records, where each record is identified by a “BillNo”.
Each record contains information about the sold out shirt from TCS related to a particular “cobrand”, which results in multiple record of similar “BillNo” but of different “cobrands”.
If you think our project is boring look at this:
Alt Text
Steps followed in the project
It is important to name variables according to the data they hold and eliminates columns with similar data. This is perhaps the most important thing when one wrangles the actual data.
EDA includes the renaming of variables, eliminating variables, and comparing them to check whether they represent the data accordingly, through various graphs and charts.
Lastly, it is important to check the summary statistics of the data.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
from sklearn.impute import SimpleImputerdf1=pd.read_excel('/Users/snawaz/Documents/pychilla2/teamproject_sep3/Deep_note_linked/Sales.xlsx')df=df1.copy()
rows,cols=df.shape
print("Number of rows in dataset are",rows)## Number of rows in dataset are 670082
print("Number of columns in the dataset are",cols)## Number of columns in the dataset are 45
df.info()## <class 'pandas.core.frame.DataFrame'>
## RangeIndex: 670082 entries, 0 to 670081
## Data columns (total 45 columns):
## # Column Non-Null Count Dtype
## --- ------ -------------- -----
## 0 BillNo 670082 non-null object
## 1 BillDate 670082 non-null datetime64[ns]
## 2 LoyaltyCard 33720 non-null object
## 3 Customer 661558 non-null object
## 4 Description 66014 non-null object
## 5 BillMonth 670082 non-null object
## 6 Warehouse 670082 non-null object
## 7 RegionName 670082 non-null object
## 8 Location 670082 non-null object
## 9 Category 670082 non-null object
## 10 DepartmentName 670082 non-null object
## 11 BrandName 668530 non-null object
## 12 CoBrand 670082 non-null object
## 13 Barcode 670082 non-null int64
## 14 DesignNo 670082 non-null object
## 15 Rejection 670082 non-null object
## 16 SeasonName 670082 non-null object
## 17 Attribute1 670057 non-null object
## 18 Attribute2 143125 non-null object
## 19 Attribute3 309262 non-null object
## 20 Attribute4 670082 non-null object
## 21 Attribute5 337322 non-null object
## 22 Attribute6 0 non-null float64
## 23 Attribute7 0 non-null float64
## 24 Attribute8 490234 non-null object
## 25 LocalImport 670082 non-null object
## 26 Color 670082 non-null object
## 27 Sizes 670082 non-null object
## 28 DiscountType 502886 non-null object
## 29 SalesmanName 670082 non-null object
## 30 Qty 670082 non-null int64
## 31 SalesReturnReason 14786 non-null object
## 32 Price 670082 non-null int64
## 33 Amount 670082 non-null int64
## 34 SaleExclGST 670082 non-null float64
## 35 GSTP 670082 non-null int64
## 36 GST 670082 non-null int64
## 37 DiscPer 670082 non-null float64
## 38 DiscAmount 670082 non-null float64
## 39 BarcodeDiscPer 670082 non-null int64
## 40 BarcodeDiscount 670082 non-null int64
## 41 NetAmount 670082 non-null float64
## 42 PointsEarned 670082 non-null int64
## 43 TaxPer 670082 non-null int64
## 44 Cobrand Acc 670082 non-null object
## dtypes: datetime64[ns](1), float64(6), int64(10), object(28)
## memory usage: 230.1+ MB
DT::datatable(head(py$df,20), options = list(pageLength = 5,scrollX=T))Since there are either repetitive values or nans so we
are dropping Attribute 4 column
## percent_missing
## BillNo 0.000000
## BillDate 0.000000
## LoyaltyCard 94.967780
## Customer 1.272083
## Description 90.148370
## BillMonth 0.000000
## Warehouse 0.000000
## RegionName 0.000000
## Location 0.000000
## Category 0.000000
## DepartmentName 0.000000
## BrandName 0.231613
## CoBrand 0.000000
## Barcode 0.000000
## DesignNo 0.000000
## Rejection 0.000000
## SeasonName 0.000000
## Attribute1 0.003731
## Attribute2 78.640674
## Attribute3 53.847141
## Attribute4 0.000000
## Attribute5 49.659594
## Attribute6 100.000000
## Attribute7 100.000000
## Attribute8 26.839700
## LocalImport 0.000000
## Color 0.000000
## Sizes 0.000000
## DiscountType 24.951573
## SalesmanName 0.000000
## Qty 0.000000
## SalesReturnReason 97.793404
## Price 0.000000
## Amount 0.000000
## SaleExclGST 0.000000
## GSTP 0.000000
## GST 0.000000
## DiscPer 0.000000
## DiscAmount 0.000000
## BarcodeDiscPer 0.000000
## BarcodeDiscount 0.000000
## NetAmount 0.000000
## PointsEarned 0.000000
## TaxPer 0.000000
## Cobrand Acc 0.000000
msno.matrix(df)Columns Attribute 6 and 7 have 100% missing values. There is no
description available for Attribute5 column so all of these
columns are dropped.
df.drop(['Attribute6','Attribute7'],axis=1,inplace=True)df.Attribute4.unique()## array(['CAMBRIDGE'], dtype=object)
Column Salemanname is a useless feature since we are not
interested to identify the sales by each person to avoid bias in
decision making.
df.drop('Attribute4',axis=1,inplace=True)
df.drop('SalesmanName',axis=1,inplace=True)
df.drop('Category',axis=1,inplace=True)The data between columns CoBrand and
CoBrand Acc is 99% similar so we are keeping only column
out of it.
from fuzzywuzzy import fuzz
fuzz.token_sort_ratio(df['CoBrand'], df['Cobrand Acc'])df.drop('CoBrand',axis=1,inplace=True)
df.drop('Barcode',axis=1,inplace=True)Based on the data description, we are renaming some columns to make it more readable
df = df.rename(columns={'Attribute1':'Inventory_status',
'Attribute2':'Offers','Attribute8':'Class_of_cloth',
'Attribute3':'Import_type'})Using returning and non-returning customers.
We identify the customers by names and assume that customers with identical names is actually a returning customer.
We will look in graphs if the returning customers are more profitable than the non-returning customers.
df['Customer']=df.Customer.duplicated()
df['Customer'].replace(True,'Returning_Costumer',inplace=True) #<<
df['Customer'].replace(False,'NoN_Returning_Costumer',inplace=True) #<<
df.Customer.value_counts()## Returning_Costumer 596405
## NoN_Returning_Costumer 73677
## Name: Customer, dtype: int64
NaNs with
No and Yes for the rest of the values.df['LoyaltyCard']=df.LoyaltyCard.duplicated()
df['LoyaltyCard'].replace(True,'Yes',inplace=True)
df['LoyaltyCard'].replace(False,'No',inplace=True)We used the sklearn imputer to impute the missing values in the data. We used the most frequent value to impute the missing values in the data.
imp = SimpleImputer(strategy="most_frequent")
df['Inventory_status']=imp.fit_transform(df[['Inventory_status']])
imp = SimpleImputer(strategy="most_frequent") #<<
df['Import_type']=imp.fit_transform(df[['Import_type']])
df['Offers']=imp.fit_transform(df[['Offers']])
df['Attribute5']=imp.fit_transform(df[['Attribute5']])
df['BrandName']=imp.fit_transform(df[['BrandName']])
df['Description']=imp.fit_transform(df[['Description']])
df['Inventory_status']=imp.fit_transform(df[['Inventory_status']])
df.SalesReturnReason = df.SalesReturnReason.replace(np.nan,'No information available')
df.Class_of_cloth = df.Class_of_cloth.replace(np.nan, "No information") # replacing Nan values with 0 value
df.DiscountType = df.DiscountType.replace(np.nan, 'No Discount') # replacing Nan values with Most frequent valuedf.describe().T## count mean std ... 50% 75% max
## Qty 670082.0 0.899518 0.611418 ... 1.0 1.0 45.0
## Price 670082.0 2437.725566 714.776793 ... 2259.0 2376.0 6127.0
## Amount 670082.0 2173.785338 1621.785795 ... 2259.0 2376.0 106920.0
## SaleExclGST 670082.0 1740.625980 1489.339984 ... 1663.0 2186.0 77455.0
## GSTP 670082.0 6.092880 2.285465 ... 6.0 6.0 17.0
## GST 670082.0 109.105263 120.435080 ... 95.0 125.0 4647.0
## DiscPer 670082.0 0.382138 3.200854 ... 0.0 0.0 100.0
## DiscAmount 670082.0 8.756731 95.321598 ... 0.0 0.0 14830.0
## BarcodeDiscPer 670082.0 17.706990 20.796915 ... 20.0 30.0 60.0
## BarcodeDiscount 670082.0 423.838609 536.809144 ... 484.0 713.0 32085.0
## NetAmount 670082.0 1849.731244 1581.756942 ... 1747.0 2295.0 82102.0
## PointsEarned 670082.0 1.078650 6.030229 ... 0.0 0.0 821.0
## TaxPer 670082.0 6.359038 0.973759 ... 6.0 6.0 9.0
##
## [13 rows x 8 columns]
Next, we checked the Pearson correlation and visualized it through a heat map. This shows which variables are highly correlated and in this case, “Qty” is highly correlated with “NetAmount”, “SalesExclGST” and “GST”.
Now we will treat the data with outliers and apply some statistical tests
There are many outliers in numerical columns which can be shown by boxplot. We will use the IQR method to remove the outliers.
# Library for skewness
from scipy.stats import skew
print('Skew of Price',skew(df.Price))## Skew of Price 1.7332960811793
print('Skew of Amount',skew(df.Amount))## Skew of Amount 0.5264229479147174
print('Skew of SaleExclGST',skew(df.SaleExclGST))## Skew of SaleExclGST 0.053907402377267245
print('Skew of GSTP',skew(df.GSTP))## Skew of GSTP 3.422663381156023
print('Skew of GST',skew(df.GST))## Skew of GST 1.8886742450354679
print('Skew of BarcodeDiscPer',skew(df.BarcodeDiscPer))## Skew of BarcodeDiscPer -0.11344278021015818
print('Skew of BarcodeDiscount',skew(df.BarcodeDiscount))## Skew of BarcodeDiscount 1.639017468026673
print('Skew of NetAmount',skew(df.NetAmount))## Skew of NetAmount 0.11210500286062883
print('Skew of PointsEarned',skew(df.PointsEarned))## Skew of PointsEarned 21.43903163597488
Now by imputing right skewed by median and left skewed ‘Price’ by mean we will remove the outliers.
df['Qty'] = impute_outliers_IQR_with_median(df['Qty'])
df['Price'] = impute_outliers_IQR_with_mean(df['Price'])
df['Amount'] = impute_outliers_IQR_with_median(df['Amount']) #<<
df['SaleExclGST'] = impute_outliers_IQR_with_median(df['SaleExclGST'])
df['GSTP'] = impute_outliers_IQR_with_median(df['GSTP'])
df['GST'] = impute_outliers_IQR_with_median(df['GST'])
df['BarcodeDiscPer'] = impute_outliers_IQR_with_mean(df['BarcodeDiscPer'])
df['BarcodeDiscount'] = impute_outliers_IQR_with_median(df['BarcodeDiscount'])
df['NetAmount'] = impute_outliers_IQR_with_mean(df['NetAmount'])
df['PointsEarned'] = impute_outliers_IQR_with_median(df['PointsEarned'])
df['DiscPer'] = impute_outliers_IQR_with_median(df['DiscPer'])
df['DiscAmount'] = impute_outliers_IQR_with_median(df['DiscAmount'])
We can apply the shapiro wilk test to check if the data is normal.
# Shapiro wilk test
from scipy.stats import shapiro #<<
print('ShapiroTest of Price',shapiro(df.Price))## ShapiroTest of Price ShapiroResult(statistic=0.8969478011131287, pvalue=0.0)
##
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1760: UserWarning: p-value may not be accurate for N > 5000.
## warnings.warn("p-value may not be accurate for N > 5000.")
print('ShapiroTest of Amount',shapiro(df.Amount))## ShapiroTest of Amount ShapiroResult(statistic=0.908415675163269, pvalue=0.0)
print('ShapiroTest of SaleExclGST',shapiro(df.SaleExclGST))## ShapiroTest of SaleExclGST ShapiroResult(statistic=0.8960244059562683, pvalue=0.0)
print('ShapiroTest of GSTP',shapiro(df.GSTP))## ShapiroTest of GSTP ShapiroResult(statistic=0.6320035457611084, pvalue=0.0)
print('ShapiroTest of GST',shapiro(df.GST))## ShapiroTest of GST ShapiroResult(statistic=0.8931977152824402, pvalue=0.0)
print('ShapiroTest of BarcodeDiscPer',shapiro(df.BarcodeDiscPer))## ShapiroTest of BarcodeDiscPer ShapiroResult(statistic=0.8688030242919922, pvalue=0.0)
print('ShapiroTest of BarcodeDiscount',shapiro(df.BarcodeDiscount))## ShapiroTest of BarcodeDiscount ShapiroResult(statistic=0.8954393863677979, pvalue=0.0)
print('ShapiroTest of NetAmount',shapiro(df.NetAmount))## ShapiroTest of NetAmount ShapiroResult(statistic=0.9133129715919495, pvalue=0.0)
print('ShapiroTest of PointsEarned',shapiro(df.PointsEarned))## ShapiroTest of PointsEarned ShapiroResult(statistic=1.0, pvalue=1.0)
##
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1757: UserWarning: Input data for shapiro has range zero. The results may not be accurate.
## warnings.warn("Input data for shapiro has range zero. The results "
Four common normalization techniques may be useful:
We use the standard scaler to normalize the data.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler() #<<
df[['Qty','Price','Amount','SaleExclGST','GSTP','GST','BarcodeDiscPer','BarcodeDiscount','NetAmount','PointsEarned','DiscPer','DiscAmount']] = scaler.fit_transform(df[['Qty','Price','Amount','SaleExclGST','GSTP','GST','BarcodeDiscPer','BarcodeDiscount','NetAmount','PointsEarned','DiscPer','DiscAmount']])Checking the normality after normalization
from scipy.stats import shapiro
print('ShapiroTest of Price',shapiro(df.Price))## ShapiroTest of Price ShapiroResult(statistic=0.8945704102516174, pvalue=0.0)
##
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1760: UserWarning: p-value may not be accurate for N > 5000.
## warnings.warn("p-value may not be accurate for N > 5000.")
print('ShapiroTest of Amount',shapiro(df.Amount))## ShapiroTest of Amount ShapiroResult(statistic=0.9033786058425903, pvalue=0.0)
print('ShapiroTest of SaleExclGST',shapiro(df.SaleExclGST))## ShapiroTest of SaleExclGST ShapiroResult(statistic=0.8960385918617249, pvalue=0.0)
print('ShapiroTest of GSTP',shapiro(df.GSTP))## ShapiroTest of GSTP ShapiroResult(statistic=0.6283426880836487, pvalue=0.0)
print('ShapiroTest of GST',shapiro(df.GST))## ShapiroTest of GST ShapiroResult(statistic=0.8964444398880005, pvalue=0.0)
print('ShapiroTest of BarcodeDiscPer',shapiro(df.BarcodeDiscPer))## ShapiroTest of BarcodeDiscPer ShapiroResult(statistic=0.8605928421020508, pvalue=0.0)
print('ShapiroTest of BarcodeDiscount',shapiro(df.BarcodeDiscount))## ShapiroTest of BarcodeDiscount ShapiroResult(statistic=0.8955459594726562, pvalue=0.0)
print('ShapiroTest of NetAmount',shapiro(df.NetAmount))## ShapiroTest of NetAmount ShapiroResult(statistic=0.9149541854858398, pvalue=0.0)
print('ShapiroTest of PointsEarned',shapiro(df.PointsEarned))## ShapiroTest of PointsEarned ShapiroResult(statistic=1.0, pvalue=1.0)
##
## /opt/anaconda3/lib/python3.9/site-packages/scipy/stats/morestats.py:1757: UserWarning: Input data for shapiro has range zero. The results may not be accurate.
## warnings.warn("Input data for shapiro has range zero. The results "
print('ShapiroTest of DiscPer',shapiro(df.DiscPer))## ShapiroTest of DiscPer ShapiroResult(statistic=1.0, pvalue=1.0)
print('ShapiroTest of DiscAmount',shapiro(df.DiscAmount))## ShapiroTest of DiscAmount ShapiroResult(statistic=1.0, pvalue=1.0)
image
## Df Sum Sq Mean Sq F value Pr(>F)
## RegionName 2 0.0 0.0019590 9.678 6.27e-05 ***
## Residuals 607200 122.9 0.0002024
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We apply the TukeyHSD test to check for which region company have significant difference from other regions.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = SaleExclGST ~ RegionName, data = Regions)
##
## $RegionName
## diff lwr upr p adj
## 2-LAHORE-1-KARACHI -1.866555e-04 -2.943515e-04 -7.895944e-05 0.0001439
## 3-NORTH -1-KARACHI 2.546619e-05 -8.458279e-05 1.355152e-04 0.8503820
## 3-NORTH -2-LAHORE 2.121216e-04 8.031625e-05 3.439270e-04 0.0004756
Results show that * Sales in Lahore is significantly different than sales in Karachi. * Sales in North is significantly different than sales in Lahore. * Sales in North and Karachi are not statistically significant from each other at 95% confidence interval.
ML has been applying in the capacity of Regression, Classification, Clustering and Deep Learning models. “Price” has been a dependent variable in every model except Classification. “Season” is used as a output variable in Classification. The concept of taking price, is to determine the customer behaviour with regards to price.
Dependent variables in various ML algorithms will be * Price for
linear regression where other select features will be independent
variable. * Price for multilenar regression where all other variables
will be independent. * Inventory_status and LocalImport
where all other are independent variables in classification ML model. *
Price in Deep learning model * 2 Clustering Models for SeasonName.
One_Hot encoding for 2 variables
Attribute8 which is actually the class of the sales and
Local ImportOne_hot_encoded_data = df_Ml[['Offers']]
enc = OneHotEncoder()
enc_results = enc.fit_transform(One_hot_encoded_data)
enc=pd.DataFrame(enc_results.toarray(), columns=enc.categories_)Inventory_status &
LocalImportdf_M = df_Ml.sample(10000).reset_index(drop=True)X=df_M.drop('LocalImport',axis=1)
y=df_M['LocalImport']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=44, shuffle =True)final_clf = None
clf_names = ["Logistic Regression", "KNN(3)", "XGBoost Classifier", "Random forest classifier", "Decision Tree Classifier",
"Gradient Boosting Classifier", "Support Vector Machine"]classifiers = []
scores = []
for i in range(10):
tempscores = []
# logistic Regression
lr_clf = LogisticRegression(n_jobs=-1)
lr_clf.fit(X_train, y_train)
tempscores.append((lr_clf.score(X_test, y_test))*100)
# KNN n_neighbors = 3
knn3_clf = KNeighborsClassifier(n_jobs=-1)
knn3_clf.fit(X_train, y_train)
tempscores.append((knn3_clf.score(X_test, y_test))*100)
# XGBoost
xgbc = XGBClassifier(n_jobs=-1,seed=41)
xgbc.fit(X_train, y_train)
tempscores.append((xgbc.score(X_test, y_test))*100)
# Random Forest
rf_clf = RandomForestClassifier(n_jobs=-1)
rf_clf.fit(X_train, y_train)
tempscores.append((rf_clf.score(X_test, y_test))*100)
# Decision Tree
dt_clf = DecisionTreeClassifier()
dt_clf.fit(X_train, y_train)
tempscores.append((dt_clf.score(X_test, y_test))*100)
# Gradient Boosting
gb_clf = GradientBoostingClassifier()
gb_clf.fit(X_train, y_train)
tempscores.append((gb_clf.score(X_test, y_test))*100)
#SVM
svm_clf = SVC(gamma = "scale")
svm_clf.fit(X_train, y_train)
tempscores.append((svm_clf.score(X_test, y_test))*100)
scores.append(tempscores)
scores = np.array(scores)
clfs = pd.DataFrame({"Classifier":clf_names})
for i in range(len(scores)):
clfs['iteration' + str(i)] = scores[i].T
means = clfs.mean(axis = 1)
means = means.values.tolist()
clfs["Average"] = means
clfs.set_index("Classifier", inplace = True)
print("Accuracies : ")
clfs["Average"].head(10)image
LocalImport
variable.X=df_M.drop('LocalImport',axis=1)
y=df_M['LocalImport']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=44, shuffle =True)
print('X_train shape is ' , X_train.shape)
print('X_test shape is ' , X_test.shape)
print('y_train shape is ' , y_train.shape)
print('y_test shape is ' , y_test.shape)RFE to select the best features for each model.features_importancemodel = GradientBoostingClassifier()
model.fit(X,y)
plt.style.use('ggplot')
plt.figure(figsize=(6,10))
feat_importances = pd.Series(model.feature_importances_, index=X.columns)
feat_importances.nlargest(50).plot(kind='barh')
plt.savefig("extra_tree.png",dpi=200)
plt.show()image
Only 7 features are good enough out of 38 to train our final model. so we drop rest of the features below.
df_M = df_Ml[['BrandName','Attribute5','GSTP','SeasonName','BillMonth','Price','DesignNo','Import_type','LocalImport']]We could also select Pearson’s corelation coefficient values
Making a dictionary of the hyperparamters for
Gradient boost model. We will apply grid cross validation
approach to select best hyperparamters for our selected model.
#Creating Parameters
params = {
'learning_rate':[0.1,1],
'n_estimators':[5,9, 11,12,13,15,20,25,26,29,31,20,50,75,100],
'max_features':['auto','sqrt','log2'],
'criterion':['friedman_mse', 'squared_error', 'mse'],
'loss':['log_loss', 'deviance', 'exponential']
}
#Fitting the model
from sklearn.model_selection import GridSearchCV
rf = GradientBoostingClassifier()
grid = GridSearchCV(rf, params, cv=3, scoring='accuracy')
grid.fit(X_train, y_train)
print(grid.best_params_)
print("Accuracy:"+ str(grid.best_score_)){‘criterion’: ‘friedman_mse’, ‘learning_rate’: 1, ‘loss’: ‘exponential’, ‘max_features’: ‘auto’, ‘n_estimators’: 75} Accuracy:1.0
It gives us the best hyperpatamters at scoring paramter chosen as accuracy.
Remember model will remain the same with similar X_train, y_train data.
# applying model with best hyperparameters
rf = GradientBoostingClassifier(criterion='friedman_mse', learning_rate=1, loss='exponential', max_features='auto', n_estimators=75)
rf.fit(X_train, y_train)
y_pred = rf.predict(X_test)
# from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
print("Confusion Matrix: ", confusion_matrix(y_test, y_pred)) Accuracy Score: 1.0 Confusion Matrix: [[ 9895 0] [ 0 191130]]
from sklearn.naive_bayes import GaussianNB
from yellowbrick.classifier import ClassificationReport
# Instantiate the classification model and visualizer
visualizer = ClassificationReport(rf)
visualizer.fit(X_train, y_train) # Fit the visualizer and the model
visualizer.score(X_test, y_test) # Evaluate the model on the test data
visualizer.show() image
import pickle
pkl_filename = "localImport_model.pkl"
with open(pkl_filename, 'wb') as file:
pickle.dump(grid, file)
# Load from file
with open(pkl_filename, 'rb') as file:
pickle_model = pickle.load(file)
# Calculate the accuracy score and predict target values
score = pickle_model.score(X_test, y_test)
y_predict = pickle_model.predict(X_test)Inventory_status as target feature.Best Params: {‘criterion’: ‘entropy’, ‘max_features’: None, ‘n_estimators’: 75, ‘random_state’: 1} Train MSE: 0.0 Test MSE: 0.146
modelkn = RandomForestClassifier(random_state=1,criterion= 'entropy' ,n_jobs=-1, n_estimators=75, max_features=None)
modelkn= modelkn.fit(X_train,y_train)
y_11 = modelkn.predict(X_test)Model scores for inventory_status
image
Confusion matrix for inventory_status
image
Target features: Price of article. &
SaleExclGST gained by company +NetAmount paid
by customer.
In one regression model there is only 1 target feature while in the other there are two target features.
Only difference from the previous approach is the use of RepeatedKFold cross validation method to get the best model.
Model evaluation is done by R2 score and negative
RMSE score.
Again we use sample of 10000 rows to compare models followed by full sample to train the final model.
models = {}
models['lr']= LinearRegression()
models['dr'] = DecisionTreeRegressor()
models['rf'] = RandomForestRegressor()
models['kn']= KNeighborsRegressor()
models['ad'] = AdaBoostRegressor()
models['ex'] = ExtraTreesRegressor()
models['sv'] = SVR()
from sklearn import model_selection
for model in models:
cv = sklearn.model_selection.RepeatedKFold(n_splits=100,n_repeats=1,random_state=1)
n_scores = model_selection.cross_val_score(models[model],X,y,scoring='neg_root_mean_squared_error',cv=cv,n_jobs=-1)
print(model, np.mean(n_scores),np.std(n_scores))lr -0.015952386710040633 0.023420996302700597 dr -0.0684330302042271 0.13595572185672142 rf -0.05608366484049382 0.13108548370117068 kn -0.953965008983964 0.20208368080610473 ex -0.056439850981039486 0.19471193286786767
We picked Random Forest regressor for further analysis.
RFECV methodimage
image
Regression Models comparison for prediciting price. Neg. RMSE score is used as evaluation metric.
image
RandomForestRegressor had the highest RMSE so we choose this model for further analysis.
optimum features selection for price prediction by running
image
image
Best hyperparameters for price prediction by
GridSearchCV method
Best Params: {‘criterion’: ‘squared_error’, ‘max_features’: ‘sqrt’, ‘random_state’: 42, ‘splitter’: ‘best’} Train MSE: 1.331754172731566e-06 Test MSE: 0.002528635026436538
modelkn = DecisionTreeRegressor(criterion= 'squared_error', max_features= 'sqrt', random_state= 42, splitter= 'best')
modelkn= modelkn.fit(X_train,y_train)
y_11 = modelkn.predict(X_test)In the we get R2 score of 1
image
and other score are given below
R-squared score (training): 1.000 R-squared score (test): 0.997
Test MSE: 0.002528635026436538 Test RMSE: 0.001264317513218269
We will select the variable seasonName. Finding the best
number of clusters using elbow method and
silhouette score.
#Use silhouette score
range_n_clusters = list (range(2,10))
print ("Number of clusters from 2 to 9: \n", range_n_clusters)
fig, ax = plt.subplots(3, 2, figsize=(15,8))
for n_clusters in range_n_clusters:
clusterer = MiniBatchKMeans(n_clusters=n_clusters)
preds = clusterer.fit_predict(df_Ml)
centers = clusterer.cluster_centers_
q, mod = divmod(n_clusters, 2)
score = silhouette_score(df_Ml, preds)
print("For n_clusters = {}, silhouette score is {})".format(n_clusters, score))
visualizer = SilhouetteVisualizer(clusterer, colors='yellowbrick', ax=ax[q-1][mod])
visualizer.fit(df_Ml)inertias = []
for n_clusters in range(2, 15):
km = KMeans(n_clusters=n_clusters).fit(df_Ml)
inertias.append(km.inertia_)
plt.plot(range(2, 15), inertias, ‘k’)
plt.title(“Inertia vs Number of Clusters”)
plt.xlabel(“Number of clusters”)
plt.ylabel(“Inertia”)# define the model
model = MiniBatchKMeans(n_clusters=6)
# fit the model
model.fit(df_Ml)
# assign a cluster to each example
yhat = model.predict(df_Ml)
# retrieve unique clusters
clusters = unique(yhat)
# create scatter plot for samples from each cluster
for cluster in clusters:
# get row indexes for samples with this cluster
row_ix = where(yhat == cluster)
# create scatter of these samples
plt.scatter(df_Ml[row_ix, 0], df_Ml[row_ix, 1])
# show the plot
plt.show()# define the model
dbscan_model = DBSCAN(eps=0.25, min_samples=9)
# train the model
dbscan_model.fit(df_Ml)
# assign each data point to a cluster
dbscan_result = dbscan_model.predict(df_Ml)
# get all of the unique clusters
dbscan_cluster = unique(dbscan_result)
# plot the DBSCAN clusters
for dbscan_cluster in dbscan_clusters:
# get data points that fall in this cluster
index = where(dbscan_result == dbscan_clusters)
# make the plot
plt.scatter(df_Ml[index, 0], df_Ml[index, 1])
# show the DBSCAN plot
plt.show()Deep learning model for predicting price and comparison with regresssion models
We are using the sequential model with 4 fully-connected layers. ReLU is more popular in many deep neural networks, but I am using Tanh for activation on trial basis.
You almost never use Sigmoid because it is slow to train. We can add drop out layer to reduce overfitting
Adam loss function is used for model compilation.
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Input, Dense, Activation, Dropout
from tensorflow.keras.optimizers import Adam
X_train = np.array(X_train)
X_test = np.array(X_test)
y_train = np.array(y_train)
y_test = np.array(y_test)
model = Sequential()
model.add(Dense(X_train.shape[1], activation='relu'))
model.add(Dense(32, activation='Tanh'))
model.add(Dropout(0.2))
model.add(Dense(64, activation='Tanh'))
model.add(Dropout(0.2))
model.add(Dense(128, activation='Tanh'))
# model.add(Dropout(0.2))
model.add(Dense(512, activation='Tanh'))
model.add(Dropout(0.1))
model.add(Dense(1))
model.compile(optimizer=Adam(0.00001), loss='mse')
r = model.fit(X_train, y_train,
validation_data=(X_test,y_test),
batch_size=1,
epochs=100)Cambridge Industries makes a database of their sales.
All the information related to customer choices such as loyalty card, product exchanges and type of product bought is available is the database.
We have performed EDA analysis followed by data cleaning and feature engineering.
Statistical aanalysis show that there is significant difference between sales offline and online sales offered by the company.
We have taken subset of data based on different products and locations. A map was shown where the products are bought and delivered.
We have used different machine learning models to predict the price of the product such regression, classification, clustering and deep neural network with output features Sales (revenue of the company), inventory_status, SeasonName and Price respectively.